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Total singular value decomposition. 
Robust SVD, regression and location-scale 

William J.J. RejQ 
Abstract 

Singular Value Decomposition (SVD) is the basic body of many statistical algorithms 
and few users question whether SVD is properly handling its job. 

SVD aims at evaluating the decomposition that best approximates a data matrix, 
given some rank restriction. However often we are interested in the best components of 
the decomposition rather than in the best approximation. This conflict of objectives leads 
(■ — ■ us to introduce Total SVD, where the word "Total" is taken as in "Total" least squares. 

_*~*^ ' SVD is a least squares method and, therefore, is very sensitive to gross errors in the 

data matrix. We make SVD robust by imposing a weight to each of the matrix entries. 
£NJ ■ Breakdown properties are excellent. 

^ ' Algorithmic aspects are handled; they rely on high dimension fixed point computations. 
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1 Introduction 



The presented approach goes upside-down. Indeed, starting with our ultimate goal, SVD, we 
Q_4 ■ precisely define the ingredients we need and, then, we keep them in mind in the course of 

' actions. The two main principles will be 

C$ • • Be robust. This will be understood as limiting the effect of each observation on the 

| estimations in a way such that it cannot unduly pull on the result. 



• Keep it simple. We would appreciate to plunge the Singular Value Decomposition (SVD) 
problem in a M-estimation setup. However, this does not seem possible without imposing 
heavy restrictions on the data matrices. We will remain pragmatic and be oriented toward 
' the numerical aspects. 

O ' 

We observe that our natural use of weights is fully consistent with the theories developed 
in the field of robustness. Without any effort, we come with the concepts of Breakdown point 



and M-estimation. 

1.1 SVD 



There are many ways to look at data matrices and, thinking at principal components, most of 
the ways favor either the rows or the columns of the data matrix; they are either observation- 
or variable-oriented. Some analysis methods try to be more balanced, for instance the (non- 
robust) biplot as defended by Bradu and Gabriel (1978), Gabriel (1998), Gower (2004) or Le 
Roux and Gardner (2005). Considering a [m x n]-data matrix X, a low-rank approximation 
is estimated (typically of rank 2) and graphically presented. Strangely, the approximation is 
worked out without taking into account the scatter of the matrix entries. Let us explain. 
The singular value decomposition 

X « U A V' , p < min{m, n} 

[m x n] m X p] [p x p] [p x n] 

where 

U' U = V V = I p and A = diag{Ai,...,A p }, Ai > ... > A p > 
is conveniently worked out with the help of an intermediate step 

XnAB' = UAV' 
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and an alternating computational scheme. This paper is centered on the approximation 

X re A B' 

[m X n] [m X p] [p X n] 

and essentially ignores the last step, 

A B' = U A 

[m x p] [p x n] [m x p] [p x p] 

seeing that it does not involve any approximation and can be performed by any (least squares) 
method. Works on alternating computational schemes are referred to in Ke and Kanade (2005) 
and Croux et al. (2003), for instance, and have flourished since the criss-crossing of Bradu and 
Gabriel (1978). 

The goal is to estimate A B' as it appears in Eq. (pQ) by a minimisation procedure 

AB> = argmin A B , {|| X - A B 1 ||} (2) 
where || . || is a norm. This problem is solved by alternating between 

A being known, iP = argmin B , {|| X - A B' ||} (3) 

and 

B being known, A = argmin A {|| X - A B' ||} . (4) 

Unfortunately, the norm in Eq. ([2]) is ill-defined and, except in the least squares set-up, it 
has little to do with those of Eq. ([3D and Eq. ([4]). Moreover, the two latter are inconsistent 
seeing that, at the end of the convergence, the two errors with respect to the approximated 
entry Xy, namely 

[Xij - (A B%] given A and [X l3 - (A B%] given B 
often differ. In fact, when robust estimation is of concern, very generally, they differ. 

1.2 Regression 

Besides the inconsistency of norms, an other little observed hindrance appears in the solving 
of Eq. ([3]) and Eq. (j4]). For the former, we wrote A being known and, of course, it is estimated 
rather than known. Considering the n columns of X in Eq. ([3]), the n columns of B' are 
solution of 

A being estimated, (B').j — argmin^,-, . {|| Xj — A (£>')-j 11} 
where X — (x\, ...,x„). In the more familiar regression notation, this is 

D being estimated, (3 — argmin^ {|| y — D (3 ||} 
under an error-in- variables model. 

As seen by Huber (1973), the robust treatment of linear regression is a simple extension of 
his approach in (1972, 1996) in the location-scale set-up. 

1.3 The location-scale set-up 

This is at the root of all investigations in robustness and, in Section 2, we report a very 
simple-minded view that is exceptionally rich in spite of being little traditional. In some sense, 
we address the reader as if he was not already familiar with all developments made in the 
robustness field. 

The location-scale set-up is further extended to handle regressions and this let unexpected 
features appear in Section 3. Section 4 concludes the exposition of the approach with the SVD 
treatment. 



(1) 



v, 

[p X n] 
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2 The location-scale set-up 

2.1 The weighted approach. 

We consider a set of m observations {xi, x m } that are, as we all assume, distributed accord- 
ing to a so-called Gaussian distribution. We are interested in their location and their scatter, 
entities that can be assessed by their mean n and their standard deviation (again, 'as we all 
know'), namely by 



\ i=l 



n=—^Xi and s x = ( — - n) 2 ] . (5) 



Some might prefer to divide by (m — 1) rather than by m in the second expression but, 
for the time being, we consider this idea as a sheer refinement. Rather, we are concerned by 
the possibility of gross errors of estimation due to possible errors in our data set {x\, ...,x m }. 
This is what robustness is about. To avoid that a few observations pull to much the mean 
toward them, it suffices to limit their influences on, for instance, the sum X^I=i x i m Eq. (|5|) . 
Inserting weights u>i, the sum becomes J2iLi w i x i- 

Clearly, the weighting must have little effect on the 'central' observations and must bound 
the extreme observations. Assuming that we bound to some constant c, this gives 

—2—, if x <C m, 

n~x 1 1 

w(x) — ^ 1, if | a; — m| <C s, (6) 
ifx»m. 

Remark, already at this level, that the weights are allocated with help of the mean and standard 
deviation estimates. Further on, we apply the weight definitions 

w{x) = (1 + \u\ q y /q where u = , < q < oo (7) 

fci s x 

and, at the limit q = oo, 

w ( x ) = I \iu,\ Vu\ 5 !' where u 



i/l«l, if|«i>i. k lST : 

that imply for the above constant 

c = lim w(x) \x\ = k\ s x . 

\x— n\ — >oo 

The so-called 'tuning constant' k\ controls the level of robustness that is desired. The parameter 
q will be numerically tried at five different values, 1, 2, 4, 8 and oo, in order to select the most 
appropriate one. The weights of q = 1 were already implemented in the numerical package 
of Klema (1978) and they are presented by Coleman et al. (1980) under the name 'Fair'. At 
q — oo, the weighting is identical to what Huber (1964) found as optimal in the near- vicinity 
of the Gaussian distribution. 

Clearly having inserted weights, Eq. ([5]) must be adapted and, at first sight, the next 
writings could appear appropriate 



=^ ^2 w i x i and s l = v^rfT 2 i x i - n )f 



The above estimator of the population variance is severely biased on two grounds: On the one 
hand, it is greatly underestimating s 2 , due to systematically down-weighting the large terms 
(x — n) 2 and, on the other hand, we have omitted to take the correct number of degrees freedom 
into account. We now deal with these two issues. 

The severe under-estimation is taken into account by inserting a correction factor 

f°° W(x) 2 m{x)dx 2 ;n / 

fc 2 2 = J 2 Yw where rn(x)=exp- x / 2 /V2^. (8) 

J _ oc w(x) z x z m(x) dx 
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Rather than dividing by m, at least in the linear context, we know that we must take into 
account a reduced number of degrees of freedom. In this approach with weights, it is less clear 
how the situation must be handled. We have opted for a correction term jfzj where N stands 
for the 'effective' number of observations, the number of observations that play a role in the 
variance estimate. Very generally, whatever an item (ui) can be, a weighted average takes the 
form Ylj—i Wi (uj) /y^j_] Wi, therefrom come the expressions of the average weight w and the 
'effective' sample size N 

■ — > Wi (Wi) and N — — — 



i=l w i ~i w Ei=l< 

Further on, we compare estimators with respect to how many observations influence their taken 
values. Very simply, this can be measured by 

N 

Efficacy =— . (9) 

m 

It is now time to gather the various items we met with. The mean n and the scatter s x are 
the solution of a fixed-point problem where, given previous estimates (n, s x ), the weights and 
the new (n, s x ) are estimated according to 

Given n and s x , 

W{ = (1 + \u i \ q )~ 1 l q , where ttj = (xi — n)/(ki s x ), 
The unbiased estimator of the variance is 



(10) 



N 

s 



N - 1 



where N = 




,j=i 

For k\ — ► oo, the above expressions collapse in the familiar least squares formulae 

Whatever a priori values n and s x , Wi = 1 

11 = — ) i i x i 

s l = Er=i(^ ~ n ) 2 ; where fc 2 = i- 

We observe that the set of equations (|10p in terms of the two variables n and s x describes a 
contracting mapping when the weights satisfy Eq. ([6]) . Hence, the numerical convergence can 
easily be accelerated in the vicinity of the fixed-point solution. 

2.2 Theory and numerical experiments 

Some aspects of the theory will be briefly sketched and further details can be found in very 
many places as, for instance, in Rey (1978, 1983). They will often be supported by numerical 
experiments. 

2.2.1 Contamination model. 

Rather than assuming that the variates Xi are distributed according to a nominal probability 
density function / ma i n j Tukey (1960) considers that a few observations could possibly be drawn 
from an other distribution / ru bbishi 

XiOif, /= (1 - e) /main + e /rubbish where e <^- 
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The contaminated distribution is a mixture distribution. The objective of robust estimation 
consists in being as little sensitive as possible to / ru bbish- As is usual, we assume a Gaussian 
distribution and outliers 

/main = ^) and / ru bbish <> A/"(/i, cr 2 ). 



2.2.2 M-estimation. 

The above weights were introduced as an intuitive manner of limiting the influence of any 
extreme observation on the estimates of the mean and the variance. They also appear very 
naturally in the context of M-estimation. 

Many estimators can be seen as solutions of minimization problems and, in the context of 
location estimation, it takes the form 



# = argmin e \ £ p{ Xl - 6) \ . (11) 
Then, under differentiability conditions, 



. i=l 



^(xi - 9) = where = — p(u). 

i=l 

In order to let the weights appear, we transform the left hand member, 

m m r^, _ ff\~\ m 

i—1 i—1 L J i—1 

and see that the weights 

= K (12) 

naturally appears in a minimization framework. 

In order to have a minimum, Eq. ([lip must be convex and this is certainly realized when 
each of its terms in p{.) is convex. Then, seeing Eq. (|12p. the restriction ([6]) on the weights 
turns out. 

All least squares approaches are based on p(u) = u 2 and that corresponds with no weighting 
(or equal non-zero weights). 

2.2.3 Breakdown point. 

This is a concept developed by Hampel (1968) that has a natural motivation. Robust estimators 
remain stable in spite of (very) extreme observations in the sample. The breakdown is a measure 
of how many very extreme observations can be tolerated. 
To illustrate, consider the weighted mean estimator 



Em / , w i %i 

1=1 ' 



i=l 

and suppose that you add k very extreme points on the right side of n. Seeing Eq. ([6]), the 
estimator becomes 

n(m, k) = n » — — ^ 

Ei=i w i 



Wi Xi + k 



the approximation being valid as long as the point addition little influences the values of the 
weights Wi. The breakdown point, BP, is the fraction such that Offset„ be very large or, 
precisely 

BP a = lim — SU ch that {Offsets = a}. (13) 

nwoo m + fcgp 

The exact value of a is immaterial as indicated in NoteO 
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2.2.4 Numerical convergence 

How fast the iterative process (|10|) converges is of great practical importance. Noting by I the 
iteration number, we observe that the mapping 

- ->(n ,s x ) ->■ (n ,sj -> (n T ,8^ )->... -> (n ,s x ) 

is contracting. The rates b n and b s indicate the convergence speed, 

n l+1 - n°° « 6„ (n l - n°°) and 4 +1 - « 6, (4 - s~), (14) 

when process (|10P is not accelerated. In the near- vicinity of (n 00 ^^ ) and for symmetrical 
distributions, expressions (1141) are exact. 

2.2.5 Numerical experiments 

All the reported experiments are with respect to the standard Gaussian distribution and this 
is further documented in Note [I] The tabulated results are asymptotic, m — ► oo, as described 
in Note El 

It is time to focus on better specifying the weight function given by the family ([7]) . Clearly, 
any given weight definition has consequences on the number of outliers that can be tolerated 
without dramatically spoiling the estimations. Hence, the breakdown point appears to be a 
natural gauge. Of course, comparing the performances of various powers must be done under 
similar conditions. Table [T] proposes a summary of the observations and the power- value q = 4 
turns out to be an interesting selection, although results with q = 2 or q = 8 arc little different. 

Table 1: Selecting the power q in the weight definition (|7|). 



Efficacy 
Eq. (JHD 


q 

Eq. © 


BPi 
Eq. CED 


0.80 


1 


0.276 




2 


0.352 




4 


0.375 




8 


0.382 




oo 


0.383 


0.90 


1 


0.195 




2 


0.294 




4 


0.329 




8 


0.339 




oo 


0.342 


0.95 


1 


0.124 




2 


0.235 




4 


0.282 




8 


0.293 




oo 


0.297 



Table [2] reports further details with the selected weight definition, 

/ \ 4\-i/4 , x — n 

w(x) = (1 + u J where u — . (15) 

K\ s x 

The parameters fci and k% control the degree of robustness of estimation by Eq. (|10p. As 
indicated by the last two columns of Table [H the convergence of process (|10p is somewhat 
slow and acceleration is welcome. We observe at the last rows that a small loss of Efficacy 
already yields a serious protection against possible outliers. 

3 Regression 

As indicated in the introduction, we consider the regression model 

y = D /3 + error 

[n x 1] [n x p] [p x 1] [n x 1] 
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Table 2: Data with the weights of Eq. flUD 



Efficacy 




z„ 

K2 




D D 

on 


O n 


L 

Os 


Jiq. (IMP 


i^q. (1XJ) 




k\ x hi 


T?« ill 91 

Jiq. (|la|) 


_. nt Ah 
i^q. ([14J) 


£jq. (|14|) 


0.5 


0.0987 


3.7227 


0.3673 


0.4110 


0.7220 


0.4777 


0.6 


0.1497 


3.0541 


0.4571 


0.4062 


0.6810 


0.4723 


0.7 


0.2244 


2.5258 


0.5669 


0.3960 


0.6275 


0.4648 


0.75 


0.2760 


2.2954 


0.6336 


0.3881 


0.5933 


0.4592 


0.80 


0.3428 


2.0798 


0.7130 


0.3754 


0.5515 


0.4513 


0.85 


0.4336 


1.8730 


0.8122 


0.3575 


0.4984 


0.4391 


0.90 


0.5686 


1.6673 


0.9480 


0.3294 


0.4265 


0.4178 


0.92 


0.6456 


1.5825 


1.0217 


0.3138 


0.3893 


0.4043 


0.94 


0.7472 


1.4937 


1.1161 


0.2936 


0.3442 


0.3853 


0.96 


0.8942 


1.3973 


1.2494 


0.2662 


0.2869 


0.3563 


0.98 


1.1537 


1.2843 


1.4817 


0.2246 


0.2063 


0.3036 


0.99 


1.4227 


1.2116 


1.7238 


0.1885 


0.1458 


0.2510 


1.00 


oo 


1.0000 


oo 


0.0000 


0.0000 


0.0000 


0.6418 


0.1773 


2.8198 


0.50 


0.4028 


0.6605 


0.4695 


0.8202 


0.3758 


1.9955 


0.75 


0.3690 


0.5317 


0.4471 


0.9145 


0.6227 


1.6059 


1.00 


0.3184 


0.4001 


0.4084 


0.9601 


0.8948 


1.3970 


1.25 


0.2660 


0.2867 


0.3562 


0.9811 


1.1742 


1.2774 


1.50 


0.2216 


0.2009 


0.2994 


0.9907 


1.4516 


1.2056 


1.75 


0.1850 


0.1405 


0.2456 


0.9953 


1.7234 


1.1605 


2.00 


0.1559 


0.0991 


0.1985 



where the solution j3 satisfies 

D being estimated, (3 = argmin^ {|| y — D f3 ||} 

under an error-in-variables model. It is a topic that includes quite a few refinements that we 
here ignore inasmuch as feasible (by implicitely assuming independency properties). 

3.1 Generalised least squares. 

The ordinary least squares approach is based on the norm 

/ d[ 

\\y-D [3\\ 2 =(y-D p)' (y ~ D 0) = ^(y 4 - d[ f3) 2 with D - ... 

i=i \ d' n 

a norm where the row- vectors d\ are seen as fully known. The terms (yi — d[ (3) 2 measure the 
fit-errors and, when we desire to take into account the total error, we must add the errors due 
to the variability of the row- vectors d\. Namely, we must complete the terms into 

(Vi - d'i f3) 2 + (3' Si (3 where S t = Cov(dJ). 

Thinking at robustness, in the weighted case the form 

n n / n \ 

- d\ (3) 2 becomes £ wj ( Vi ~ d[ (3) 2 + £ w 2 sA f3 

8=1 i=l \i=\ ) 

and the above norm || y — D (3 \\ 2 definition eventually is modified into 

n 

{y-Df3)'W{y-D(3)+f3' S D p, where S D = J^ w i 

2=1 

Note that matrix W is with respect to the squared weights, 

W = diag{io?, ...,«£}. 
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Thus, the generalised least squares approach takes the form 

= argmin^ {(y — D 0)' W (y - D 0) + 0' S D 0} 
that yields to the estimator 

= J' 1 D' Wy 
with the next sandwich estimator of covariance 



Cov(/3) 



N 



N- 



i- 1 £ 



W: 



6idi Sift 



Cidi Sift 



J- 1 



where 



J = D' W D + S D = J2 w t l d i d 'i + S i\ and e i =Vi- d 'iP- 



The connections with ridge regression and with total least squares are evident. 

The derivation of above Cov(/3) has been made by the infinitesimal jackknife of Jaeckel 
(1972) according to Rey (1983, Eq. 4.36) and sandwich estimators are surveyed by Freedman 
(2006). Their lack of efficiency is notorious as investigated by Kauermann and Carrol (2001) 
and it can be associated to the high stochastic variability of the central factor between the 
curled braces. In order to stabilize this factor, we imply independence conditions similar to 
those assumed in ordinary linear regression. We approximate the central factor and eventually 
obtain the covariance estimator (see further details at Note|3| 



Cov(/3) 



N 



N -p 



j- 1 



(16) 



with 



N 



wr 



En a 



and 



En o 



3.2 Robust generalised least squares. 



The extension to the robust context is natural and follows closely the above presentation for 
the one-dimension location estimator by the set of equations fllO[> . 

Given previous estimates of [fi, s), the weights and the new (/3, s) are estimated according 

to 

Given f3 and s, 



Wi = (l + uf) 1,/4 , where i 

J = Eti^f [di d[ + S t ], 
= J- 1 D'Wy, 



(y-D 0)* 

k 3 s 



(17) 



The associated covariance estimator 



N 



Cov(/3) = kt jj— J- 1 \ < sHid'i + (Si0)(Si0)' 



J 1 where N 



En 



is corrected for underestimation. It turns out that the constants It3 of and &2 are the same as 
for the one-dimension location estimator. Hence, the values of Table[2]are relevant in regression 
as well. 
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4 Singular Value Decomposition, SVD 

We deal with the minimisation ([2]) , 

AB> = argmin A B , {\\ X - A B' ||} , (18) 
where the approximation has rank p 

X « A B' 

[m X n] [m X p] [p X n] 

As we know, matrices A and _B can be scaled and rotated in any convenient way; to limit this 
indeterminacy, we impose the weak condition that 

A be an orthonormal basis. 



4.1 Total SVD, in view of statistical applications 

First, in order to illustrate the inconsistencies that we mentioned in the introduction, we 
consider a simple numerical example. Let the data matrix to be analysed by singular value 
decomposition be next X . Except for a minor perturbation at level of the second decimal place 
and an outlier, it is of Rank 1; the last entry should have been X5 3 = 15. 



X 



( 1 


2 


3 


\ 




/ 


-92 


3 


-17 \ 


2 


4 


6 








48 


6 


-8 


3 


G 


9 




+ 0.001 




2G 


-4 


-64 


4 


8 


12 








8 


-2 


92 


V 5 


10 





/ 






17 


-3 


/ 



/ 0.908 
2.048 
3.026 
4.008 
\ 5.017 



2.003 
4.006 
5.996 
7.998 
9.997 



2.983 \ 
5.992 
8.936 
12.09 
/ 



Applying the ordinary (least squares) SVD, the approximation is A B' 

( 

d AB' = 



Ordinary SVD 
Rank 1, nonrobust 



X 



1.167 
2.364 
3.527 
4.741 
2.536 



2.326 
4.710 
7.027 
9.445 
5.053 



2.574 \ 
5.212 
7.777 
10.45 
5.592 ) 



The outlier has completely spoiled the evaluation. 

Initialising with this least squares approximation, we bring the attention on the two norms 



used in the evaluations of A and B' 
weights on the data entries, 



Running Eqs. (|17p with ks = 1.5, they induce robust 



B' = argmin B , {\\ X - A B' ||} 



weights = 



/ 0.931 


0.984 


0.999 \ 


0.997 


1.000 


0.997 


0.999 


0.988 


0.977 


1.000 


0.982 


0.980 


\ 0.039 


0.009 


0.010 / 



and 



A = ar gm \n A {|| X - A B' \\} 



weights 



/ 0.987 


0.975 


0.869 \ 


0.998 


0.949 


0.870 


0.997 


0.953 


0.868 


0.997 


0.956 


0.867 


\ 0.997 


0.955 


0.867 J 



Clearly these two norms do not match. Moreover, this does not help for better defining the 
norm of Eq. (|18|) . As will soon be seen and in order to avoid the above incompatibility between 
the norms, we allocate the same set of weights for the evaluations of 



B> = argmin B , {|| X- A B' ||} and A = argmin^ {|| X - A B' ||} . 
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The second issue we raised in the introduction is with respect to the decomposition into 
an A B' product. Each of the two components is estimated "as if the other one was properly 
"known" rather than "estimated" . 

We now further detail and, momentarily for the simplicity, we place ourself in the ordinary 
least squares context. The squared Frobenius norm of Eq. ([2]) can be written as the double 
summation 

{m n j 
i=l j=l J 

where the residuals of all approximated X-entries are minimized. Clearly, Eq. (|19p guarantees 
a good approximation, namely a good estimation of the A B' product. However in very many 
cases and, specifically for statistics in most low-rank reduction applications of SVD, we are 
interested in good estimations of A and B. The qualities of the two components have the 
utmost importances rather than the quality of their product. Then, it leads to minimize 

!m n I 
12 12 [foj - a 'i b ^ 2 + a 'i Cov ^ a * + b 'i Cov ( a *) M \ ( 2 °) 
i=l j = l J 

that has a total least squares flavour; the minimisations takes place on the p columns of A and 
the p rows of B' . Accordingly, we call "Total SVD", the singular value decomposition based 
on the norm Eq. (|20|) (or Eq. (|21[). further on). 

Seeing that the alternating process of Eq. ([3|) and Eq. ([4]) can yield informations on 
Cov(&j) and Cov(aj), the present set-up is less general although many of the arguments of 
Golub and van Loan (1980), most of them also reviewed in the classroom note of Nievergelt 
(1994), still hold. In some sense, Eq. (|20p describes a trade-off between best approximating 
and best estimating the components of the approximation. The increased complexity of Eq. 
(|20p compared to Eq. (|19|) induces practical difficulties that are already reported in Gabriel 
and Zamir (1979). In fact, the information we obtain on Cov(b^) and Cov(aj) is less rich than 
the full estimations of these matrices. For instance, the estimation of B by Eq. ([3]) provides n 
covariance matrices of dimensions [p x p], whereas Cov(bj) in Eq. (|20|) has dimensions [m x m\. 
This mismatch of dimensions clearly displays a difficulty; we estimate A and B column-wise, 
but we use them row- wise in Eq. (|20p . However we can make use of the variances of the A- 
and _B-entries, their covariances are not suitably available. This leads to substitute 

DiaCov(aj) = diag{cr 2 (ai j i), cr 2 (a m i)} in place of Cov(aj), 

and similarly for Cov(6j). Hence, Eq. (|2Q|) takes the form 

{m n 
12 12 i( Xi 3 ~ a ' 1 ^J') 2 + a ' 1 DiaCov(6j) a % + bj DiaCov(ai) bj] 
i=l 3 = 1 

' (21) 

Cancellating DiaCov(fej) and DiaCov(ai), Eq. (|2 1|> collapses into the ordinary problem state- 
ment given by Eq. ([2]) . 

4.2 Estimation of Total SVD 

In view of the past expositions, first we introduce the general algorithm in a few lines as being 
the solving of a fixed point problem. 

Given previous estimates of |j4, DiaCov(aj)j = i... m , B, DiaCov(6j)j = i... n |, the weights and 
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to 



the new |a, DiaCov(ai)i = i... TO , B, DiaCov(foj)j = i...„| are estimated according 
Given |a, DiaCov(a.i)i=i... m , B, DiaCov(6j)j = i... n j . 
Evaluate the weights . 

Given A and DiaCov(a.j)j = i... m , estimate £? and DiaCov(6 J ) J=1 „. 
Given £? and DiaCov(&j) J= i... n , estimate A and DiaCov(ai)i = i... m . 

We review the four steps and this gradually leads us to the compact algorithm stated by 
Eqs. (1261) . 



(22) 



Given |a, DiaCov(a,) l= i... m , DiaCov(&,) J= i... n j. 



In fact, the algorithm does not do any use of any assumed DiaCov(6j-) 3 - = i„, n . The past 
estimation of A is entered in the next algorithmic step and, for the re-estimation of A, 
we enter a DiaCov(6j)j=i... n as found while B was estimated. 

Hence, a better formulation would have been "Given | A, DiaCov(ai)i = i... m , , B |" . Taking 

into account the definition DiaCov = diag{<7 2 (ai.i), a 2 (a mj i)}, we observe that we only 
need to enter the variances of the A-entries. Eventually, the proper formulation is "Given 

B, sj". The last parameter, s, is incidental; it is entered to 
speed up the evaluation of the weights. 

• Evaluate the weights tuy. 

As indicated, we allocate the same set of weights for the evaluations of the two norms 

B 1 = argmin B , {|| X - A B' ||} and A = argmin^ {|| X - A B' ||} . 

They are derived from the m x n approximation residuals (X — A B')ij and depends on 
their scatter s in a way that is familiar to the reader. Formally, this gives the following 
algorithmic piece. 

Given |a, S,s| . 

Evaluate the residuals fk — (X — A B')^, 
Iterate on the expressions : 

w k = {i + [h/{k 3S )fy 1/ \ 
n = Efc=i w k] I E*=i w l\ > 

v = (m + n ~ ^i— ) 



1, 



(23) 



iV 



^ [Efe=i /i] / En 



Allocate the Wfc 



The iterations converge linearly and should be accelerated. 
Given A and DiaCov(aj)j=i... m , estimate B and DiaCov(6j) J= i...„. 

The generalised least squares context of Section lOl here is relevant and, without further 
delay, we present the corresponding algorithmic piece that solves 

S 7 = argmin B , {|| X — A B' \\} . 
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Given (a, <7 2 (aik)i=i...m,fe=i..p} , 
Successively, for each of the columns Xj of X, 

Si = diag{cr 2 (aji), . . . , (J 2 (a ip )} , Dim.: [pxp]. 

J = E"=i w ij [ a i a 'i + &]) Dim - : [PXP]- 
Wj = diag (wij, W n j) , Dim.: [mxm]. 



(-B')j = J^ 1 A' Wj aij, Dim. : [ P x 1], 



(24) 



^ J = (E™ 1 ^) 2 /E™ 1 



Cov [(B')j] = ki jfe J- 1 {YZi < tf<K< + (B')j]') } j-\ 

a 2 {b ]k ) = (Qov[{B') 3 ]) kk . 

The notation distinguishing rows from columns requiring some attention, as well as for 
convenience, we noted the matrix dimensions between square brackets. 

• Given B and DiaCov(6j) J= i... n , estimate A and DiaCov(ai)j=i... m . 

Except for matrix transpositions, we mimic very closely Eq. (1241) and present the corre- 
sponding algorithmic piece. It solves 

A 7 = argmin A , {|| X 1 - B A! ||}. 

Furthermore, we complete by imposing the condition that A be an orthonormal basis. 



Given ^B,a 2 (bjk)j=i... n ,k=i..p \ , 
Successively, for each of the columns Xi of X', 

■= diag{<T 2 (bji), ...,a 2 (b jp )} , Dim.: [pxp]. 
J = X)J=i u>ij [bj b' 3 + Sj], Dim. : [p x p]. 

= diag w n j) , Dim. : [n x n]. 

(A')i = J- 1 B' Wi Xi, Dim. : [ P x 1], 

«? = [E -=i *4 (*« - 6, (^')*) 2 ] / [E - =1 <4 

/ \ 2 

N, 



(25) 



(E"=i^) /E"=i<- 
Cov [(A')i\ = fc 2 2 ^ J- 1 {E - =1 ^ (s?Mj + [S, (A')i\ [S, (A'),}') } J -1 . 
a 2 {a lk ) = (tov[(A') l ])' kk 



Transform A into an orthonormal basis. 
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At the end of the convergence of Eqs. (|22p. the orthonormalisation of A becomes an 
identity operation. 

Summing up, the above algorithmic parts obtain the Total SVD solution as a fixed point 
of | A, er 2 (a l!fc )i=i...m,fc=i..p,,-B,sj according to 



Evaluate the weights Wij by Eqs. (123 
Estimate B and by Eqs. ([24]) . 
Estimate A by Eqs. ([25]) . 



(26) 



Before leaving this presentation, it must be noted that some difficulties of convergence can 
occur (see Note [5]). 

We apply the variants of the above algorithm on the [5 x 3] example and specially draw the 
attention on the value taken by £5.3 (that should be £5.3 = 15). We already met with the usual 
SVD result, 



Ordinary SVD 
Rank 1, nonrobust 



X 



AB' 



( 1.167 2.326 2.574 \ 

2.364 4.710 5.212 

3.527 7.027 7.777 

4.741 9.445 10.45 

\ 2.536 5.053 5.592 / 



Taking into account the limited precision of product components A and B adds little to the 
quality of the resulting approximation, 



Total SVD 
Rank 1, nonrobust 



X 



AB' = 



/ 1.078 


2.147 


2 


376 \ 


2.183 


4.349 


4 


813 


3.257 


6.489 


7 


180 


4.377 


8.721 


9 


651 


\ 2.342 


4.666 


5 


164 / 



However, becoming robust induces a major quality improvement, 



Ordinary SVD 
Rank 1, robust : kz = 1 



X 



A B' 



( 1.005 2.000 2.983 \ 

2.018 4.016 5.989 

3.012 5.995 8.941 

4.018 7.997 11.93 

\ 5.021 9.995 14.91 / 



and this is even better when Total SVD is estimated, 



total SVD 
Rank 1, robust : £3 = 1 



X 



AB 1 



( 


0.9990 


1.989 


2.987 \ 




2.009 


3.999 


6.006 




2.998 


5.969 


8.963 




4.034 


8.032 


12.06 


V 


5.020 


9.995 


15.01 / 



Thinking at biplot applications, we analyse the European Health and Fertility data treated 
at Section 5.1 of Croux et al (1993). It is their [16 x 9]-Table 1 set and clearly it has columns 
of very different scalings, 



European Data 



0.1 


48 . 


. 3440 


6 


1.8 


50 . 


. 2716 


7 


0.2 


47 . 


. 3593 


6 


1.9 


49 . 


. 3218 


8 


0.5 


51 . 


. 3499 


7 


0.2 


48 . 


. 3421 


5 



arXiv.org 0706.0096 



14 



The scalings are so different that it makes little sense to act as if some homoscedasticity could 
be assumed. Hence, we transform the data set and the columns become centered and with 
variances 1, 



European Data 



/ -0.8778 0.26812 
2.2616 1.4938 
-0.3821 -0.3447 



2.4268 0.88097 
0.11359 2.1067 
-0.3821 0.26812 



The treatment allocate a weight to each of the entries, 



0.43654 -0.0768 \ 
-2.2925 0.53751 
1.0132 -0.0768 



-0.4003 
0.65893 
0.36492 



1.1518 
0.53751 
-0.6911 



Total SVD 
Rank 2, robust : 



Weights 



0.95895 


0.77783 . 


. 0.94071 


0.98998 


0.27558 


0.32983 . 


. 0.20418 


1.00000 


0.79620 


0.97667 . 


. 0.56217 


1.00000 


0.98847 


0.99972 . 


. 0.99998 


0.99144 


1.00000 


0.21158 . 


. 0.65800 


0.94428 


0.99977 


0.96668 . 


. 0.93092 


0.93200 



and we do not observe any very small weight. We are entitled to say that no entry is clearly 
outlying. The 5 lowest weights are 



Weight 


Entry 


Country, Factor 


0.18181 


(7,9) 


H, baby underw. 


0.18522 


(2,7) 


AL, inhab. doc. 


0.20418 


(2,8) 


AL, calorie 


0.20607 


(13,3) 


SU, women % 


0.21158 


(15,2) 


YU, give birth 



Contrary to the observation of Croux et al (1993), we do not have to eliminate by down- 
weighting the full fourteenth row corresponding to Turkey; what properly fits with the Rank 
2 decomposition, properly contributes to our estimation. 

Of course, the situation becomes more extreme when we increase the level of robustness. 
Now we clearly find outlying entries, 

/ 



Total SVD 
Rank 2, robust : ka 



0.5 



Weights 



0.16691 


0.18247 . 


. 


27318 


0.25611 


0.04368 


0.06851 . 


. 


04143 


0.60169 


0.26271 


0.79319 . 


. 


12728 


0.99098 


0.07989 


0.99997 . 


. 


99992 


0.50327 


0.99110 


0.04645 . 


. 


14016 


0.30331 


0.99951 


0.16474 . 


. 


38905 


0.42217 



and the above 5 low weights decrease, 



Weight 


Entry 


Country, Factor 


0.03469 


(7,9) 


H, baby underw. 


0.03422 


(2,7) 


AL, inhab. doc. 


0.04143 


(2,8) 


AL, calorie 


0.03915 


(13,3) 


SU, women % 


0.04645 


(15,2) 


YU, give birth 



As is ordinary in the field of robustness, the analyst must master the tools he uses. A com- 
plementary feature is how vary the eigenvalues of A = diagjAi, X p } in the ordinary SVD of 
the approximation A B'\ this is reported at Table [31 
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Table 3: Eigenvalues of A B' for the European Health and Fertility data. 





Rank 1 


Rank 2 


Rank 3 


Ordinary SVD 


8.5194 


8.5194 


5.7931 


8.5194 


5.7931 


3.2940 


Total SVD 














&3 = OO 


7.5963 


7.7213 


4.7238 


7.1071 


3.6495 


3.1878 


k 3 = 2.0 


7.4691 


7.5167 


4.5743 


7.1802 


3.6663 


3.1619 


k s = 1.0 


7.3074 


7.3593 


4.0388 


7.0150 


3.3955 


3.0195 


k 3 = 0.5 


7.2501 


6.4301 


3.1360 


6.3143 


3.098 


1.0491 



Numerical notes 

1. The Gaussian numerical sample. 

For a sample of size m, we represent the Gaussian distribution by its m quantiles defined 
as follows: we generate the points 

-1/2' 



such that 



(x) dx 



where 



2. Breakdown point BP a by Eq. ([13]). 

We add k points to a sample of the standard Gaussian, constructed as above by m 
quantiles. The k new points are located far away, namely at x — 10 6 . Then, Offset„ 
varies continuously with k and we can evaluate BP a for any a value. For numerical 
convenience, we take a = <j x and evaluate BPi. 

Clearly, the estimated BP a depends on the selected a-value. However, this dependence 
is very moderate. The curve BP a (a) is linear for small a and, fairly suddenly, displays a 
sharp increase that defines a quasi-asymptote. 

3. Asymptotic estimations. Consider an estimator t = t(m) that depends on the sample 
size m used to represent the standard Gaussian distribution. We assume the model 

t(m) = i(oo) + — 1 ±- 
m + t 2 

and expect the parameter t 2 to be small. 

The three sample sizes m — 100, 300 and 900 are used in the experiment. A model fit 
delivers the asymptotic value i(oo). 

In the linear regression context of Eqs. (I17p . rather than implementing with respect to 
the two factors of the product k-$ = k% x fc 2 , it is convenient to parametrize on &3 and 
approximate k 2 by an expression such as 

In k 2 w (0.4762- 0.8465 In fo + 0.4554 In 2 fc 3 )/(l - 0.3425 lnfc 3 ). 

4. Covariance estimator, Eq. (|16|) . 

In addition to the approximation 



Sip 



Si/3 



s 2 ^^ + {SifyiSipy 



where s 2 = e 2 , we modified the scaling factor obtained by the infinitesimal jackknife, 
N/(N — 1), into the N/(N — p) familiar in ordinary linear regression. The above ap- 
proximation and the latter modification have been supported by simulation runs with 
arbitrary weights and diagonal matrices Si . The covariance estimator Eq. (|16p tends to 



slightly overestimate the true covariance when the right term of 
dominates the left one, on the average. 



2 dX + (Sip)(Si$y 
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5. Computation of Total SVD by Eqs. (|26|) . 

The experimental estimation of the fixed point of Eqs. (|26[) has presented hazards. It 
sometimes occurs that the mapping contracts in a fairly small vicinity of the solution 

S, s| . Eventually, we have been solving 

[(xij - a- 6j) 2 + t (a- DiaCov(6j) a, + b'j DiaCov(a 4 ) bj)] 

for a parameter £ varied from t = to t = 1. We apply a predictor-corrector method of 
continuation built around the fixed point algorithm. 
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